Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora

نویسندگان

  • Rahma Boujelbane
  • Mariem Ellouze
  • Lamia Hadrich Belguith
چکیده

Nowadays in tunisia, the arabic Tunisian Dialect (TD) has become progressively used in interviews, news and debate programs instead of Modern Standard Arabic (MSA). Thus, this gave birth to a new kind of language. Indeed, the majority of speech is no longer made in MSA but alternates between MSA and TD. This situation has important negative consequences on Automatic Speech Recognition (ASR): since the spoken dialects are not officially written and do not have a standard orthography, it is very costly to obtain adequate annotated corpora to use for training language models and building vocabulary. There are neither parallel corpora involving Tunisian dialect and MSA nor dictionaries. In this paper, we describe a method for building a bilingual dictionary using explicit knowledge about the relation between TD and MSA. We also present an automatic process for creating Tunisian Dialect

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model

Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costl...

متن کامل

Morphological Analysis of Tunisian Dialect

In this paper, we address the problem of the morphological analysis of an Arabic dialect. We propose a method to adapt an Arabic morphological analyzer for the Tunisian dialect (TD). In order to do that, we create a lexicon for the TD. The creation of the lexicon is done in two steps. The first step consists in adapting a Modern Standard Arabic (MSA) lexicon. We adapted a list of MSA derivation...

متن کامل

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that s...

متن کامل

Collaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of soc...

متن کامل

The Effects of Factorizing Root and Pattern Mapping in Translating between Tunisian Arabic and Standard Arabic

The development of natural language processing tools for dialects faces the severe problem of lack of resources. In cases of diglossia, as in Arabic, one variant, Modern Standard Arabic (MSA), has many resources that can be used to build natural language processing tools. Whereas other variants, Arabic dialects, are resource poor. Taking advantage of the closeness of MSA and its dialects, one w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013